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. Rodels to the data was studied. From the data sets analyz 


. little effect on the results, (Author) 
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Some Results on the Robustness of Latent Trait Models 


Ronald K. Hambleton and Linda L. Cook ore 86. al 
University of ce lata 6 Amherst 4 


5 jieeract : ‘ 


The purpose of the present research was to study, systematically, ot ® 


the Henadnesssorwere! of the one-, two-, and agacue aereeioulviie 
models. We studied, using computer-simulated test data, the effects - 
of four variables: Variation in item discrimination parameters, the 
\ average valye of the pseudo-chance level parameters, test jengeh, and 
_ the shape of the ability distribution. . Artificial or simulated see 
representing departures of varying degrees from the assumptions of the | 
4 three-parameter si tae test model were generated’ and the ‘goodness-of- 
: fit" of the three test models to: ‘the data was studied. | 
\ , From the data sets er in. the study, it is clear that there 


/ 


\are some sizable gains to be expected with modest length tests Cig 20) 


x ‘in the correct ordering of examinees at the lowertend of the ability’ 
goutinnum when three-parameter model estimates ave taed (as opposed to 
the number right sedeei. the gains were cut roughly in half when the | : 
tests were doubled (n = 40) in length. -Item discrimination parameters 


as scoring weights had very little efféct on the results. 


. 


s 


«a . rae a ck 4 : a 1 A 
P vik e-—]- “ oa ' a . é. 
: ; A 


o, 


» The topic of’ latent trait. theory was introduced to educational ee 
ae ae ae i ayeh vs ' 
measurement specialisté- over 25 years ago by'Frederic Lord (1952, 


" 1953). Until recently, hile work and the work of other paychonstrietand +: 
F ; sacl? igtice xeake theory field received only Aiwlhed abtention free. ta, 
test sceetinionera: scenes important breakthroughs vanenay in. ; ak. = 
problem areas such as test score equating, tatlored testing, test a 
design and test evaluation through applications of dalene trait theory 
have attracted considerable interest from measurement specialists. 
‘3 ; , 


Other factors that have contributed to‘ the ‘current interest in latent 


, : ‘ ' an tk 
trait theory include the availability of a number ‘Of useful computer | ; ‘ 


¥ + 


programs, publication of a variety of ‘successful applications in: , = : 


measurement journals, and, She atHEnE sridochenght of the field by 


authors of the last three reviews of test theory in the Annual Review of 


J e 


Psychology. Another ceatiooty to the current interest wed sapulaeity of ' 


‘the topic is the fact that the Journal of Educational Measurement pub- a a 
lished six invited papers on. Jatent trait theory and. aupticeeiens in the 7 | 
summer igsue of 1977.. (see for example, Hambleton “{ Cook, 1977; 

Lord, 1977; Wright, 1977. ) 


y 


All of the latent trait ‘wadels of interest in this paper (the one-, 


‘ 


two-, and sicsbsnnnseute deaueyae test models) rest on one important 


% 


*, assumption: For practical reasons it is usually assumed the items are : 


¢ . 
homogeneous in the sense they measure the same single ability. From , 4 


. 


ia there, users must specify the mathematical form of the "item character- 
istic curves." An item characteristic curve represents the probability 


ofa correct’ answer to an item expressed as a function of ability. gin 


” ~~ 


the one-parameter model, items may vary in difficulty levels only; in 


the two parameter model, items may vary both in level of difficulty and 


. 


r ‘discrimination; and in the three-parameter model, items may vary: in level of . 

. * ’ ‘ * ‘ “ .7  '~, a. % 

difficulty, discrimination, and pséudo-chance levels.. The mathematical 
“form of ‘the three-parameter logistic ‘curve is written es 


6 Da (6@-b ) 7 . : a ‘ 
gz g 
P, (8) =c,* (1-c,) 


= 1 2; eee n. 
L4edag (0-5) » & . iE ’ — 


| s a 


vIn this eager Pe (8) is the probability that -an examinee with Sdlter 
e answers item g correetiy: "bg" is the indee of item difficulty; "ag" ts 


_ the index of item Sinead partes and "eg" is the pseudo-chance' level. 


’ 
‘ 


"The reader is réferred to Hambleton and Cook (1977) for a more detailed 
discussion of the item parameters. It should be noted that the item 


characteristic curves can be applied to binary scored. iteps Sensaverered 


eadér non~-speeded test’ conditions. The two-parameter model is obtained 
. ’ 


from the three-paraneter Bedet by setting c 270: “The ne-paraneter 1 model . 


4 


is “obeainéd from the three-paraneter model by setting cy = 0 and 

ag =a constant, g = £2 seey We a of = 

. ' : 4 
While the eke Gi usefulness of latent trait models is high, 


theré remain many me problems to address at the application avane 


For one, how does a uger go about selecting a latent, trait model? One 


ra 


might be benptas to diy a the user should: always pom with. the more - 
q 
general models since than models will provide the: "best" fitg to a 
j 
available test data, ‘uFovtduately: ‘the more generel latent trait models 


“(for example, the three-paraneter logistic test model) require more 


computer time to Spee: sat@$factory eee require larger samples 


of examinees and longer tests, and are more difficult ‘for practitioners 
_to woe with. Clearly, ‘more needs to be known. about the ' ‘goodness-of- ns 
fit? re "robustness" of latent trait ‘models. Such information would aid 


practitioners in the important step of Peer eae a test model. 


5 5 : . : ? ; 


2: 


asa ' oa . 


; ' ae % . | 


“”~ 


[There tas’ been “some work .on the "goodness-of-fit" between latent 


trait models and a variety oF, test data sets nas for example, iord; 1975; 


Tinsley and Shawl, 1977; ane Wetght, 1968) and generally the results 


have been good (Hambleton, Svanigachin, Sook, “Eignor, and Gifford, 1978). 


- = 


Only one study we have seen dompated -the fit of more than one pELERE 
trait model to the, same test data sets (Hambleton and Traub, 1973). In 


at of ~ 


this study, improvements ‘were obtained in predicting testa score distri- 


butions (for three tests) .from the two-parameter model as compared’ to the 


beg ae 
one~paramepér model, ag , 


; ‘On e question of addel robustness (i. e., the extent to which the 


sinuaptions underlying the test ies: can n Be violated to a greater or 


7 Messer extent by the test data and be "fitted" by the model), the results 


“gm 
aa, 


i 
-of several studies have been’ reported (Dinero and Haertel, 1977; Hambleton, 


1969; Hambleton and Traub, 1976; ‘Panchapakesan; 1969). The results have 


- 


‘been mixed, perhaps because of the confounding of results with sample 


sizes. 


- ity 


and the robustness studies “astbad to date is that they provide no 


tadieaeddntot the practical conséquenites of fitting a "less than perfect" 


model to a test data set. a6 sea) is of little ancerese to the practi- 
tioner to knew that 15 out oe 20 ievaia failed to be fitted by aeheet 


model. when the Penge of discrimination eee reached (say) a value 


vof -80. > For one thing, if the size of the avanlnke sia dae is large 


enough, utabeyey all items could be identified by a chi-square statistic 


‘ +) Ne 


of yoodtesé—of-fik as not fitting the model. If the size of ‘the examinee 
sample is iat enough, eechive none of the. items would be misfit 


by the model! We think it would be interesting for practitioners to see 


a 


i. £3 | 


The erables as we see it with most of the rar? studies _ 


a 


y | hy ; : 
comparisons of latent trait models and then "fit" to various data 
sets using a criterion measure (or measures) that have some practical 
iheaning to them. To date there have been no comparative studies of 
the various latent trait tedels using practical criteria to judge the 


be 


- results. 


Purposes of the Research _ 

| The purpose of the present: research was to avi: sontonabiealie: 

the li gsadnebano boric” of .the one-, two-, and three-parameter logistic 

models. We studied, using. compineess Daulated test date; the effects | 
,  W&E£ four wartabien: varices Amgiten discrimination parameters, the» 

ayerage value of the cancers level parameters, test length, and 

the shape of the ability distribution. Artificial ox similated data ly 

representing Jepartores of varying degrees from the assumptions of the 

three-parameter, logistic test ta were generated and the "soodness-of- 

fit" of the three test models, to the data was daa 

"Say should "goodness-of-fit" be measured? It seemed to us that, in some 
testing situations, (for example, some situations involving doewrnterenced 
tests), test users desire to rank examinees based on their ‘test score) 


‘ LJ 


performance in a way tie will closely reflect rankings ‘haeed on’ ‘eiaplines 
"true Porro “Much effort is made by test developers to rank ecaninges: 

, properly (1.6.5 “"validly") by using suitably long tests, “‘$Wehenad tits test: 
items, proper test conditions and so on. Utilizing the two- and three- 
cavanater models with many test ‘data sets will also be helpful in’ accomp- 
lishing the stated, goal» of ranking examinees in a way that will be 


consistent with rankings based on’ "true" ability scores. 


-5- E: oy 
to : 3 
In this study, because we used simulated data, it'was possible to 


a ll examinee pees scores. They served as our criterion against 


e 


_ohteh to. judge the statistics derived from, the ‘three test models for ae 


\ 


a ranking exdninede: Three statistics, derived ‘from the one-, two-, and 


? 


: three-parameter logistic models, ‘respectively, were obtained and used to 


rank examinees. The rankings of examinees derived from each model’ (for 


each set of test data) were then compared to examinee "true" abilities. 


The Spearman rank difference formul4 was used to bummarize the similarity 


. 


between each pair of ‘ranks (true abilities and estimates of ability from 
one,of the models). We also reported the average size of the discre- 


pancies in the ranks for each group of 500 examinees. 


. AS an aside, we note that it would have been desirable also to 


(a 


compare ability estimates, denoted 6, and true ability scores, denoted 


6. Unfortunately, because of the arbitrariness of the scale on which 6“ 


is measured, it would have been of very limited value to report. summary 


ecabiecies such as £ |@,- 6 IN. In some of our later work we will address 


i=l ae 
he scaling problem through equating methods. 9 


Method 3 
Method | 


Simulating the sess Data 


e 


\ 


The simulation of item response data for examinees was accomplished 


. * 


i 
using rhe: three-parameter logistic model. ‘Bret, the number : of examinees 
(N), shape of the ability distribution, ‘ih viont of the ability EBRARRERES 


(8, = o> ee N) were specified. Next, the number of items in oe test 
i , eo . 


- 


(n) and values of the three item parameters (ay, by» Cy, 8 = 1, | ee n) 


were specified. Then the examinee and item parameters were substituted 


ae or 


zat 
g . - ae & ‘ae 
in the sauaic of the three-parameter logistic model to obtain a_ 


number i (0 < na <1) representing the probability, that examinee 


savceoely answered item 4: : The- probabilities were arranged in a 


| matrix P of prded Nxn tions’ (1, ))th element was Pay" P was then con- 


1 +6 
verted into a matrix of the item scores Foe examinees (1 = correct 


answer, 0 = incorrect answer) by comparing each Pij with a random number 
obtained from a-yniform.distribution on the interval [0, 1]. If the 
random number was less than or equal to Pay (which would happen on the 


average Pij of the time), was set equal to 1, otherwise Pay was set 


Piy 


to 0. The matrix P of zeros and ones: was the simulated test data. At 


this point, three statistics used in israel examinee ability were 
n n : 
calculated: £ ut, ¢f Apu, and t w 36?) u 3° 

gel © gel gnl 


@orresponding to statistics which are used in the estimation of examinee 
ability with the one-, two-, and three-parameter models, respectively. 


= 1 for. a correct response, u = 0, otherwise.) For the 


g 8 
three-parameter model statistic, since the item weights [w, (@)] depend on 


(Recall, u Ps 


examinee ability, we obtained three-parameter model estimates: of ability 
for each examinee.from LOGIST (Wood, Wingersky, and Lord, 1976).! Once 


we had calculated the three-parameter model estimates of ability, we use 
: 
them (instead of = w (6) u_) for convenience. 
; gl g gz ; 


: i 
IThere has been some discussion by practitioners of the difficulties 


of using LOGIST, and the costs involved. We were able to install the 
program very quickly on our CYBER 70 System and the cost of typical 


‘runs in our study (20 or 40 items, 500 examinees) was about $2.00. We 


should add that these results were obtained for the case where item param- 
eters are known. 


7, 


\ 


\ ‘ 
The values of the examinee and item parameters were chosen as 
‘ \ 


follows: 


x 


Examinee Parameters. The number of examinees was set equal to 
- * 500. This number was sufficient to produce stable goodness-of-fit re 7 


sults. Two distributions of ability were considered: Uniform [-2.5, 


4 a 


2.5] and Normal [0, 1]. | | wy 
Item Parameters. Two test lengths (20 and 40 items) were used in ee 


; = 
the simulations. Both values are fairly typical of test lengths in 
; ‘ \ 
_ common use. 


In the simulation of test data, item difficulty parameters, bes 


g= ie ee n, were selected at random from a uniform distribution 
of the interval [-2, 2]. An analysis of the difficulty parameters ree ‘ 
Be by Lord (1968) suggested that this decision was reasonable. ‘ 

The discrimination paraneters, aay 8 =.) 25 a n, for the items : 
of a simulated test were selected at random from a uniform distribution 
sik mean = 1.12. The range of the discrimination parameters was a vari- 
able under investigation. The range was varied from 0.0 to a maximum of 
1.24 [.50 to 1.74], and aa Raveenediace value of .62 [.81 to 1.43] was 
also studied. The maximum value of discrimination was eimilar to the 
erange and distribution of the dita aes oamed reported for the ‘“e 
Verbal Section of the SAT (Lord, '1968). . . 

The extent of guessing sie simulated test data was another 
variable under study. Two values of the average guessing parameter were 
considered: ¢c = 0.00, and ¢ = 0.25. All pseudo-chance level parameters 
were set equal to the mean value of the c-parameter under investigation. ° 

Factor Structure. For all of the tests simulated in eile: ehidys, ae. 


was assumed that the test items were unidimensional, i.e., measured a : 


) 
common trait. ’ A 


10 — 


Goodness-of-Fit F Le . ‘ = . 
‘ The approach to goodness of fit was Sein getbier is the pur- . 

poses section of the paper. For each data set (24 in total; 2 test 
ienucne Ss 2 levels of. pseudo-chance parameters x 3 levels of variation 
in diaerintnation snvareters x 2 ability distributions), thsee statistics 

. used in estimating ability for the one-, , two-, and three-parameter. models, 
respectively, were calculated and waien to the true ability parameters. 
Comparisons were made via the use of Spearman rank difference formula 
and the average discrepancy in ranks. a 

fe furgher facilitate the interpretation of results, they are 


reported separately for each hagt of the ability distribution as well 
my ; . . ! 


as for the total ability distribution, ~ j \ 
Results 


The results of our computer simulations are summarized in Tables 


l to 6. The first row of each table was inserted to serve as a ser 


* 9 f 
For convenience we will discuss the results in point form around 


» 


the variables under study: 
f : ‘ 

Level of Variation in ene eee ay ae 

a ‘ 1 ° 


1. For the values studied in the paper, using discrimination 
parameters as item weights contributed very little to the 
proper ranking of examinees. ‘ 


Leve of Pseudo-Chance Level Parameters 


2. With the twenty-item tests; the three-parameter model was 
considerably more effective at ranking examinees correctly 
in the lower half of the ability distribution. Correlations 
were about .08 higher ( v.75 to ~ .83) in the uniform dis- 
tribution of ability and about -08 higher in ‘the normal 


oan PEL 


~ 


Table 1 . : és 


. Summary of the Goodness-bf-Fit Results 
(Uniform Ability Distribution,! 6 = -2.5 to 0.0) ; 
; i 


Comparison| of Estimates 


Variation in Pseudo-Chance Test Score. True Versus One True\Versus Two True Versus Three 
Discrimination Level Statistics Parameter Model . Parameter Model Parameter Model - 
Parameters Parameters . X_ sD r2 AAD? for AAD g * * AAD 
_ 0.00 7 .00 ‘5.03 3.00 881 54.238 *, 881 54.238 -881 + 54.238. 
0.00 025 * 8,98) 2.86 765 76.610 - 765 76.610 . .827 64.984 «= 
81 to 1.43 200° <° 5.24 3.10] ° .877 56.068 876 56.406 ~ 876 56.404 
- * 81 to 1.43 © 225 "9,01 2.84] * .760 77.144 - 764 76.900 - 833 64.284 
.50 to 1.74 .00 5.36 3.02 874 56.496 -874 56.558 874 © 56.562 © 
' ,50 to 1.74 025 9.12 2.83 747 80.076 4750, «79.920 ~ 827. = 65.770 
0.00 _ 00 6958 6.22 2944 36.482 -944 36.482 944 % 36.482 ° 
0.00 + ‘17.82 5.33 868, 58.578 .868 ~ 58.578 - 908 48.704 
ne r 1 : : , : : 
-81 to 1.43 ~ 00 . 10.14 6.37 2949 . 36.504 949 36.474  .949 . _ 36,474 
-81 to a 25 - , 17.98 5.41 872 57.662 875 56.860, 912 48.014 
) ) | vy | 
-50 to 1.74 0 . 9.97 6.397 =. 942 37,862 4946 * 36.962¥. 2946 36.742 
50 to 1.74 2 18.18 5.41 870 57.824 876 56.872 -910 48.222 
% . 
__ | AN = 500 
2Spearman Rank-Difference Formula . 
~*er~Sayerage absolute difference in rank order : 
| Pa ee 
"Sane 1° ; ; 4 : So 


»~6- 


Fo . 
i = 
= i‘ 12 ; ast ‘ 
. «© Table 2 : s 
Summary! of the Goodpess-of-Fit Results : 
(Uniform Ability Distribution,* 6 = 0.00 to +2.5) ek " 
- Comparison of Estimates 
a Variation in Pseudo-Chance Test Score True Versus One True Versus Two True Versus Three 
Test Discrimination Level | - Statistics — Parameter Model Parameter Model _ Parameter Model 
Length, Parameters Parameters : X SD r2 “AAD? og AAD r AAD 
20 3°, “ele - 00° 24.99 2 82 - 883 54.450 877 55.624 - 877 - 55.624 
we 20 x 0.00 ’ 425 "16.21 42.13 835 63.676 -828 65.350 +829 65.726 
20 -81 to 1.43, 00 - “15.12 2.75 -891 52.234 an: | 55.376 -881 | 55. 382 
20, -81 to 1.43 «25 — :16.16 2.14 - 847 63.802 “s 65.018 -841 63.190 
7 3 ‘ ‘ e 
20 -50 to 1.74 > * 00 _ 14.93- 2.79}. 872 56.988 ° - 882 ' 55.384 - 882 55.470 |. 
20—C«, -50 to 1.74 ee 16, 36 2.09 0797 - 71.570 2797 70.720 - 804 Pe. ° 
"40 0.00 . .00 31.73 - 5.55 -940 39.034 * .936 40.496 -936 40.496 
40 - 0.00 025 33.52 4.37 -903 50,188 .- -898 ° 51.046. - 896 50.852 
40 ..81 to 1.43 -00 31.30 5.53 2935 40.648 ° 2932 — 41.832 -932 41.848 én 
40 «81 to 1.43 * 225 33.47 4.26 .908 49.142 -903 50.554 ~905 50.266 ‘ 
«< fi = ‘s . ri 
40 50 % 1.74. 00, 31.15 5.39 0934 ‘40.788 -939 38.932 939 | 38.940 
40, -50 to 1.74 025 33.40 4.16 - 890 52.882 - 892 52.898 893 52.678 . 
" \ ew ' 
IN = 500 - : , » ° 
2Spearman Rank-Difference Formula ; ‘ 
Average absolute difference in rank order: : ‘ 
‘ P P 2 4 3 : ’ ‘. . . 
AD ; | ~ 15 
\ ro ¢ ° @ Z e 
x bs £ 
: ' % 
S: ep * a a 2 : 


. f 
\ 
Table 3 ° 
: Summary of the Goodness-of-Fit Results 
‘(Uniform Ability Distribution,! 6 = - 2.5 to +2.5) 
‘ . Comparison of Estimates ~- 
8 Variation in Pseudo-Chance Test Score True Versus One True Versus {Two True’ Versus Three 
Test Discrimination Level Statistics Parameter Model Parameter Medel Parameter Model 
Length Parameters Parameters x SD r2 AAD3 . | AAD : mr 
» ro a 

20 0.00 -00 9.91 5.84 -970 28.264 F -970— 28.368 : -970. - 28.368 

20 ° 0.00 * 025 12.40 4.43 2932 41.850 93 41.972 949 36.968 

20 -81 to 1.43 00 9.97 5.63 969 28.808 * 969 | 29.138 969 29.149 4 

20 -81 to 1.43 sae = * 32,26 4.35 931 es 42.402 ‘i 43.932 943 38.594 - 

20° : 50 to 1.74 -00 10.50 5.58 965 30.826 - .966 ~ 30.140 966 - .140 4 

20 _ 50 to 1.74 Pr + 12.40 4.54 -932° 42.200 sy ~ 42.726 -942' _ 39,016 “J 

5. CEO FS. Gee 420.99. 12.21] .984 20.438 984 20.6146 984 ~—-20.614 

40 : 0.00 * ‘2s 24.54 9.40 2964 30.130 -964 30.260 971 27.018 

40 .81-to 1.43 -00 20.31 12.54] -983 21.088 .983 21.250 983 - 21.254 

40 , .81 to 1.43: 025 », 24.58 9.36 . 962 30.690: ; 962 30.750 sO7L x 27.738 

46 -50 to 1.74 00 , 19.93 12.12]. .981 22.478 982 21.814 982 21.808 

40 -50 to 1.74 . «25 24.94 9.16 962° 31.490 964 - 30.498 0972 ,27.302 

¥ . 5 < * id . . 
IN = 500 mene # ye 
2Spearman Rank-Difference Formula — ‘ oe, : ; 
3average absolute difference in rank order : 2 
: he 5 bs. 
¥. { ‘ 
if = 


16 


‘ Variation ig _Pseudo-Chance Test Score 
Test Discrimination Level ‘Statistics 
Length Parameters -Parameters xX SD 
20 0.00 00 ‘6.77 2.69 
20 * 0.00 »25 _ 10.04 2.54 
20.81 to 1.43 00 6.72 2.66. 
- 20 .81 to 1.43 25. 10.10 2.56 
20 .50 to 1.74 0: 2 Foe.” ae 
20 . 50 to 1 74 25 : 10.25- 2.57 
40 0.00 -00 13.61 5.48 
— 40 0.00 .25 20.06 4.78 
} 40. 81 to 1:43 .00 13.65 5.55 
40 81 to 1.43 025 20.19 4.86 
40. +50 to 1.74 -00- 14.29 5.78 
40 ~50 to 1.74 25 20.47 4.90 
ly = 560 . ; »* 
@ Spearman Rank-Difference Formula 


sSummary of the Goodness-of-Fit Results 


e 


SS Table 4 


(Lower Half of Normal Ability Distribution,’ X, = 0.00, SDy = 1.00) 


Saverage absolute difference in rank order 


- 


ae Comparison of Estimates 
True Versus. Two 


True Versus One 
Parameter Model 


r* 


817 
649 


-835 
-653 


796 


+655 - 


+909 
813 


-903 
-810 


-901 
-805 


‘ 


AAD? 


65.584 


94.928 


62.716 
95.184 


70.646 
94.628 


46.026 
68.700 


48.234 


68.078 


48.218 
69.010 ° 


+ 
rs 
we 


Parameter Model 


r 


817 
-649 


-830 
+645 


-909 
-813 


L 
“ 


69.428 


AAD 


65.584 
94.928 


63.262 
95.774 
95.800 


46.026 
68.700 


47.276 
67.048 


46.580 
68.662 


. 


. True Versus Three 
‘Parameter Model 


r 


-817 
- 736 


-830 
729 


801 


725 


909 
848 


-907 
+852 


-909 
-848 


AAD 


65.584 
82.536 


63.312 
83.486 


69.414 
83.380 


46.026 


61.626 — 


47.280 
60.094 


46.582 
61.578 
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Table’ 5 ve > aF 


; Summary of the. Goodnéss~of-Fit ResUlts 
(Upper Half of Normal Ability Distribution, ! ‘Xy = 0. 00, SDp * = 1, ooy 


¢ \ 
e) a 


$$$ EL ee 


; Comparison of Estimates 
Variation in. Pseudo-Chance Test -Score © True Versus One T Versus Jwo = True Versus Three 


Test Discrimination Level Statistics Perameter Model Parameter Model Parameter Model 
Length Parameters . Parameters - X SD r2 AAD3 xr AAD r AAD 
20. 0.00 ~ © 200 ” #3 35,,37 2.62 844 e 60.506 - . .844 60.808 .844 60.808 
20 0.00 25 15.12, 2.20 -761 75.752 -759 76.158 .769 75.076 
20 -81 to 1.43 © ' 00 13.37 2.61,) .853 © 61.088 - .852 . 61.596 - .852 ‘ 61.606 
20 -81 to 1.43 «25 15.12 2.18 sfa9 °° 76.406 «757 78.024  .769 75.628 
20 =, .50 to 1.74 « 600 © 13.439 2.52] .834 64.792 846 ; 63.084 .846 63.076 
20 -50 to 1.74 «25 15.11. 2.12 [° .749 78.686 vise .. 79.920 .767 © 77.012 
- 40 » 0.00 -00 27.96 4.93] . .895 50.714 -895 50.748 .895 50.748 
40 0.00 025 31.02 3.75 | 823 65.180 , .822 65.448 .833 64.236 
. ; . . 
40 81 to 1.43 © 00. 28.28 =§.91] - .894 51.252. .898 50:212. .898 50.226 
. 40 -81 to 1.43 «25 * 31.it 3.81 -824 65.924 -830 - 64.838 .839 63.160 
40  .50 to 1.74 .00 28.39 4.90] 892 51.014° .898 49.954 898 49.952 
, 40 -50 to1.74: 025 31.20 3.77 -808. 67.604. -822 64.512 .828 63.958 
ly a « 4 & A 
2Spearaan -Dif ference, Formula 
Saverage-a solute ee in rank order 
en 


-€1- 


ie eeepc Tbe JG 
ag : 
_ ntl ; : a, , Summary of the Goodness-of-Fit Results 
2 a (Normal Ability Distribution,! Xp = 0.0, SDg = 1.0) 
* ¥ re Comparison of Estimates 
: Variation in \Pseudo-Chance Test Score True Versus One True Versus Two True Versus Three. 
Test Discrimination Level Statistics Perameter Model Parameter Model Parameter Model 
Length Parameters-- ~~ Parameters X SD r=. AAD? r AAD r AAD 
20 0.00 ; .00 . 10.30 . 4.27 -940 36 .844 -940 36.906  .940 , 36.906 
» 20 0.00 : = gS 12,37 3.49 .883 53.940 -883 53.896 - .908 47.554 
20 -81 to 1.43 00 ~ 10.43 4.33 943 35.868 944 35.988 944 35.982 
20 -81 to 1.43 025 12.40 3.46 . 882 54. 306 883 54.336 .905 48.610 
20 -50 to 1.74 * ,00 10.51 — 4.20 930 41.114 «932 40.958 932 40.962, 
20 -50 to 1.74 025 12.48 3.50 .873 55.726 . 865 57.942 .881 527120. SS 
1 
40 ~ 0.00 -00 21.22 9.21] ° .971 26.598 0971 26.620 0971 26.620 
40 0.00 025 25.78 7.11 946 36.442 -946 ° 36.464 956 33.030 
40 -81 to 1.43 .00 20.90 9.39 973 25.196 973 25.536 973 25.534 
40 81 to 1.43 025 25.88 7.017 ° .939 38.864 942 ~ 37.648 952 34.148 
40 -50 to 1.74 *  ,00 20:87 + 8,99 -970 27.038. ~ «972 25.878 972 25.874 
40 -50 to 1.74 025 ~ 25,91 * 6.99 0937 38.794 -941 37,330 e951 34.676 . 
. s } 
IN = 500 
2Spearman Rank-Difference Formula ; 
Saverage absolute difference in rank order 
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‘ ra 
distribution (v.65 to v.73). ‘The improvement in the average 
ae absolute difference in rank order was about 13. 


3. With the forty-item tests, the three-parameter model was also 
somewhat more effective at ra king examinees correctly in the - 
lower half of the ability distribution. Correlations were 
about .04 ‘higher in both ability distributions. The improve- ~ 
ment in the average absolute difference in rank order was 
about 8. The reduction in effectiveness of the three- 


- parameter model weights was to be expected with the longer 
tests. Gulliksen (1950) noted the insignificance of scoring 

* weights when:the test gets longer and test items are posi- 
tively correlated. 

-4, For examinees in the upper half of the ability distribution, 
and for the data sets studied, the number rights score was’ 
about as effective as the more complicated scoring weights 
used in the two- and three-parameter models. 

» Shape of the Ability Distribution 

5, As expected, correlations tended to be ai ae the uniformly 
distributed apa ety scores. 

Test Lengt , 

6. It is breyten to observe the increases in correlations 
due to doubling the'length of the test. Again, as expected 
they tended to be rather small. 

rey ¥ 
¥ 
Conclusions 
From the data sets analyzed in this study, it is clear that there 


| . are wii sizable gains to be expected with, modest length tests (n = 20) 

| in the scteeab wndaeine of examinees at the lower end of the ability — 

| continuum when three-parameter model estimates are used (as opposed to 
the number right score). The gains were cut roughly in half when the. 
tests were doubted tn = es in length. It was also surprising (to us) 
that item discrimination parameters as weights had so little effect 
on the results. On the other hand, Gulliksen (1950) had summarized the 
research on item weights nearly thirty years ago and came to essentially 


4 


4 


‘i ; : a 5 a a oe < rer * 
the same donc lustant “This brings us to what we feel is avery important 
art ; point. To the extent that our simulated data ‘sets are typical of real ¢ 
data, it would appear that the application of latent trait models to 
the problem of "ranking" eee is probably not worth the trouble 
except in those situations where wains:o? the size anced See lower ability 
’ examinees in she Mena are important. The number right score does nearly 
as good aieh of ranking eiarinees aS the peek complicated scoring 
methods. 
‘We do caution the reader however from generalizing eha-raaulien a 
froma single study. For one, the authors have not had enough experience 
.fitting the three-parameter model to real data to feel sure ‘about the "typical" . 
values of the item parameters. It is possible that our simulations do not closely 
reflect real data. Second, our criteriot'measure of goodness of fit seems 
suitable for the situation in which a user desires to make norm-referenced 
_ interpretations of his/her test scores. There are many other test situa- 
tions (for example, those involving tailored tests, test score -.eqtating, 
and eriterion-referenced tests) where a different criterion to judge the 


quality of a solution would be more suitable., Third, the results of our 


*. 


study’ provide a somewhat unfair comparison of| the two-parameter model 


4 


% with the other two models. This is because the item discrimination param- 
’ eters used in the weighting process to derive statistics for ability. 


~ estimation would have been somewhat different had the "best-fitting” two-, 


* ** 


° parameter curves to the three-parameter item characteristic curves been 


-used. The item discrimination parameters in the "best fitting" two- ‘ 


a : parameter curves would have differed somewhat from those defined in the Bo 


three-parameter curves they were fitted to. 


‘ 


3 Z : a 


at : 


* 


<. : al Jos 
Fs a: . , in 
. A final point should also be stressed. The correlation results 


‘of the one-parameter model and (to a much lesser extent) the two- 
parameter model are inflated (to an wknowh extn because of tied 
scores. Therefore, the ‘true differences in the veusoidd correlations 

are somewhdt larger than those reported in’Tables 1 to 6. ‘i This error 


; i) ” in our methodology will be corrected before we prepare our paper for 


publigetsins =s : oo , 


4 . say v3 


Pd ’ f ‘ 

+, In summary, the’ future of latent trait BnESEY as'a framework for 
a 2 tgs 

o solving -educat ional pntlan problems has been firnly established. There 
4 ' have already: been major breakthroughs in, important ‘areas: of testing 


. through the use of i i trait aw a is our hope aes our methods 
. and results will encourage others to seek to dattan and to use other 
practical criteria for comparing che results of fitting latent trait 
models to simulated as well as real data to ‘the. extent it is. poasinte to 


do so, Certainly there is ‘eubstantial need for more ‘research ated at 


sha haha eieacuabisii rete mats nay sntiesabheeilendhtetaenitc haf icontemtettetacrennnii kompreghn eben 


providing practitioners with practical guidelines for model selection, test 


<2 RA RE Rg nS ens teeh tern 


Be design and test score analysis. ~y® 4 a ; 
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